ECCV 2024 Tutorial:
Time is precious: Self-Supervised Learning Beyond Images
30th September, 09:00 to 13:00 CST, Amber 7 + 8
MiCo Milano
MiCo Milano
Overview
SSL has allowed pretraining of neural networks to scale beyond the size of labelled datasets, demonstrating robust performance without the need for costly annotation. This approach has successfully scaled training dataset sizes to billions of images.
However, state-of-the-art (SoTA) models often learn representations limited to single-image inputs, lacking temporal context. Visual representations derived from such static images are limited to learning from disjoint static snapshots of the world. This limitation is particularly pronounced in recent SSL techniques, all of which are trained on meticulously curated and object-centric datasets, such as ImageNet. Attempts to scale up single-image techniques to larger, less-curated datasets like Instagram-1B have not yielded substantial improvements in performance. A single image, regardless of artificial augmentation, has its constraints: it cannot create new perspectives of an object or anticipate unfolding events in a scene.The primary goal of this tutorial is to introduce to the computer vision community, the concept of learning robust representations by leveraging the rich information in video frames. While, image-based pretraining has gained recent popularity with SimCLR, the practice of pretraining models from videos dates back much earlier. This tutorial will recapitulate both early and recent works, which have pretrained image encoders using videos for different pretext tasks such as egomotion prediction, active recognition, dense prediction etc. We also discuss practical implementation details relevant for practitioners and highlight connections to other, existing works such as VITO, TimeTuning, DoRA, V-JEPA etc. We also discuss recent works aimed to mimic human visual systems such learning from one continuous video stream and by learning from longitudinal audio-visual headcam recordings from young children, thereby putting this concept into a broader context.
Motivation
These works from the last 6-12 months demonstrate a paradigm shift, showcasing SSL models pretrained on videos can outperform image-based pretraining. The notable progress validates this trend, making this tutorial on past and present advancements crucial for newcomers in the field. Key questions we aim to tackle in this tutorial include:
- Can we learn strong image encoders from good quality videos (i.e. with limited data)?
- Do we need synthetic augmentations? How useful are the natural augmentations in videos?
- Can we learn from a continuous stream similar to humans?
Speakers
INRIA
University of Amsterdam
University of Amsterdam
Google DeepMind
GenAI, Meta
Independent Researcher
Schedule
Title | Speaker | Slides | Talk |
---|---|---|---|
Introduction | Mohammadreza | Slides | Talk |
Part (1): Learning image encoders from videos Prior works |
Shashanka | Slides | Talk |
Part (2): New Vision Foundation Models from Video(s): 1-video pretraining, tracking image-patches |
Yuki M. Asano | Slides | Talk |
Coffee Break |
|||
Applications (1): Learning from one continuous stream: single-stream continual learning, massively parallel video models, perceivers |
João Carreira | Slides | Talk |
Applications (2): What makes Generative video models tick? Emu Video (text-to-video), FlowVid (video-to-video), factorizing text-to-video generation, efficiency |
Ishan Misra | ||
Applications (3): SSL from the perspective of a developing child Audio-visual dataset, development of early word learning, learning from children |
Emin Orhan | Slides | Talk |
Conclusion, Open Problems & Final remarks | Yuki M. Asano |
About Us
Shashanka is a final year PhD student at the LinkMedia team in INRIA, France, advised by Yannis Avrithis. He conducts research on the topic on self-supervised learning, specifically on learning image representation from videos and data-augmentation methods. He has organized several deep learning workshop focusing on a broad range of topics including diffusion models, RAG, backdoor attacks etc. in his university.
Mohammadreza is a third year PhD student at the QUVA lab, University of Amsterdam advised by Yuki Asano, Cees Snoek, and Efstratios Gavves. His research focuses on representation learning, with a special emphasis on learning image representations from videos. In addition to his primary research, he is also deeply engaged in the field of machine learning safety, working towards ensuring that AI systems are reliable and safe for society.
Yuki M. Asano is an assistant professor at the Video & Image Sense (VIS) Lab at the University of Amsterdam and leads the Qualcomm-UvA (QUVA) lab. He conducts research on the topic on self-supervised learning, multi-modal learning and augmentations, which has resulted in works such as GDT, SSB, SeLa or single-image pretraining and most recently the self-supervised learning from videos such as TimeTuning and DoRA. He has served as Area Chair for CVPR 22-24, ICLR 2023 and NeurIPS 2022. He has also organized several workshops such as SSLWIN in ECCV 2020, ECCV 2022, BigMAC in ICCV 2023.